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Abstract 

We present a GPU accelerated GUDA-C implementation of the Barnes Hut (BH) tree code for calculating the gravita¬ 
tional potential on octree adaptive meshes. The tree code algorithm is implemented within the FLASH4 adaptive mesh 
refinement (AMR) code framework and therefore fully MPI parallel. We describe the algorithm and present test results 
that demonstrate its accuracy and performance in comparison to the algorithms available in the current FLASH4 version. 
We use a MacLaurin spheroid to test the accuracy of our new implementation and use spherical, collapsing cloud cores 
with effective AMR to carry out performance tests also in comparison with previous gravity solvers. Depending on the 
setup and the GPU/GPU ratio, we find a speedup for the gravity unit of at least a factor of 3 and up to 60 in comparison 
to the gravity solvers implemented in the FLASH4 code. We find an overall speedup factor for full simulations of at 
least factor 1.6 up to a factor of 10. 

Keywords: gravitation, hydrodynamics, methods: numerical, stars: formation 


1. Introduction 


Self-Gravity is a key phenomena in many Astrophysical 
Simulations, hence the solution of Poisson’s equation of the 
gravitational potential for a given density distribution 
p{x) 

V'^(j){x) = 47rGp(a;) (1) 

is one of the main functions in these simulations. The 
direct method for evaluating the gravitational potential at 
position Xi of a point mass rrii requires the evaluation of 
all pairwise interactions in the system. 


^{Xi) 


iv^j 


( 2 ) 


where, toj is the mass of the jth particle and r, and rj are 
the position vectors of particle i and j. While the direct 
method with its 0{N‘^) arithmetic complexity is concep¬ 
tually simple, it is obviously unsuitable for larger systems. 
Luckily several heuristic algorithms are available which re¬ 
quire fewer operations within acceptable error bounds. 

In general, one can distinguish between the grid-based al¬ 
gorithms and the tree-based algorithms for calculating the 
gravitational potential. We use the FLASH code package 


(Fryxell et ah, 2000 Fisher et al. 20081 and three of the 


implemented algorithms: the multipole solver, the multi¬ 
grid solver and the tree solver. 
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The Multipole Poisson solver is based on a multipolar ex¬ 
pansion of the mass distribution around a certain center of 
expansion. Both accuracy and runtime can be controlled 
via the multipole cutoff value Imax- The multipole ap¬ 
proach is by nature appropriate for systems with spherical 
mass distributions, such that a spherical harmonic expan¬ 
sion can be expected to reach high accuracy after a small 
number of terms. (Gouch et al., 2013) 

The Multigrid Poisson solver is a modified version of the 
direct multigrid algorithm of Huang & Greengard adapted 
to the FLASH4 grid structure (Ricker 2008). The Multi¬ 
grid Poisson solver is appropriate for general mass distri¬ 
butions. 

The Tree Poisson solver is based on the Barnes & Hut 
tree code where the implemented octree is an extension of 
the AMR mesh tree down to the individual cells (based on 
the FLASH4.2.2 release). The tree Poisson solver is ap¬ 
propriate for general mass distributions. In the following 
sections we refer to the Barnes & Hut tree Poisson solver 
as CPU-BH tree solver. 

Although the heuristic solvers reduce the computation time 
for the gravity module, they still require a large amount of 
computation time, often much more than the integration 
of the MHD equations. 

In this paper, we developed a GPU accelerated Barnes 
& Hut tree Poisson solver for astrophysical applications 


based on the GUDA runtime library (NVIDIA 2015 Nick- 


oils et al. 2008). Our GPU accelerated Barnes & Hut tree 


code implements tree walks as well as tree builds solely on 
the GPU-Device. By accelerating the gravity module with 
the GPU accelerated Barnes & Hut tree code, we measured 
a performance improvement of more than factor 20. In the 
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following sections we refer to our GPU accelerated Barnes 
& Hut tree Poisson solver as GPU-BH tree solver. 


2. Algorithm 


The Barnes & Hut tree algorithm for three-dimensions 
works by grouping “particles” using an octree structure 
(Barnes and Hut 19861. A single cube containing all par¬ 


ticles surrounds the system. This cube is then recursively 
divided into eight sub cells with each containing their own 
set of cells. The tree building method continues down in 
scale until only one particle is left for every sub cell. The 
tree construction can be done using a bottom-up i.e. in¬ 
serting one particle at a time or a top-down approach by 
sorting the particles across divisions. Both methods take 
0(N log N) time. For each tree node, the total mass and 
the center of mass is calculated. The force on a particle in 
the system is evaluated by “walking” down the tree. At 
each level, every node is tested against a test particle if it 
is distant enough for a force evaluation. If the node is too 
close, it is “opened” and the 8 children are selected for the 
same procedure. Various criteria exist to test whether a 
particle is sufficiently distant for a force evaluation. The 
most common and simplest criterion is based on the open¬ 


ing angle parameter 9 (Barnes and Hut, 19861. If the size 


of a node is I and the distance of the particle from the cell 
center of mass is d, the node can be accepted for a force 
evaluation if 

d > 1/0. (3) 

Smaller values of 9 lead to a higher accuracy cause by more 
node selected for opening. Typically a 9 value of 1 leads 
to errors around 1% (Hernquist, |1987[ ). In some cases, 
in which the center of mass is near the edge of a node, 
the basic criterion described in Eq. can produce large 
errors (Salmon and Warren 19931 Various alternatives are 


given in literature to avoid this problem. (Salmon and 


Warren 1993 Barnes, 19941 Here, we adopt the opening 


angle parameter described by Barnes (19941 with 


d > 1/9 + S, 


(4) 


where 6 is the distance between the center of mass of the 
node and the geometric center (see fig. . This criterion 
guarantees that if the center of mass is near the node’s 
edge, only positions removed by an extra distance S use 
the cell for a force evaluation. In case the center of mass 
is near the node’s center, the old criterion (Eq. is used. 



Figure 1: The Geometry of the Barnes S>c Hut opening criterion used 
in the GPU-BH tree code. 


particles in the tree does not lead to simple load balanced 
domain decompositions. To distribute the work to many 
independent processors, the particles and the tree must be 
divided in a balanced way. Possible parallelization meth¬ 
ods are described in ( Salmon[|1991 ) and (Dubinski 1996). 
The basic workflow of one parallel Barnes &: Hut algo¬ 
rithm is outlined in fig. The parallel Barnes & Hut 
algorithm starts with a domain decomposition to achieve 
a load balanced particle distribution. The decomposition 
can be carried out using any suitable algorithm e. g. the 
method of orthogonal recursive bisection the method of 
orthogonal recursive bisection (Dubinski 1996[), the Mor¬ 


ton space filling curve (Bedorf et al. 


Hilbert space filling curve (Bedorf et al. 


2012) or a Peano- 
2014| )p] ven 


a load-balanced distribution of particles among different 
processors, a BH-tree can be constructed using the locally 
stored set of particles (fig. Construct local trees). If 
every processor had a copy of the complete domain, the 
force calculation could be carried out now. Unfortunately, 
the amount of memory required for a full copy of the do¬ 
main data is prohibitive and thus, we use a smaller but 
sufficiently large enough sub-tree. 

The LET (locally essential trees) approach assumes that 
only a subset of tree nodes is necessary since the BH- 
tree can be pruned by using the opening angle criterion 
(Salmon 1991). Applying the opening angle criterion to 
the entire group of particles on a donating processor, a 
subset of nodes can be selected that are necessary for a 
successful force evaluation on a different processor. The 
selected nodes are then sent to the target processor (fig. 
Exchange tree nodes). On the target processor, the exist¬ 
ing local tree is enlarged in this way by a pruned version 
of the trees resident on other processors (fig. Exchange 
tree nodes). This local essential tree is then traversed and 
the gravitational potential for the particles is evaluated 
(fig. Tree walk). 


2.1. The Parallel Tree-Algorithm 

The parallelization of the Barnes & Hut algorithm is 
not obvious, since the inhomogeneous distribution of the 


3. The GPU Barnes-Hut Tree Code for FLASH4 


The approaches from (Salmon, 1991) and from (Dubin¬ 


ski 1996) have been modified and implemented into the 


^We use the term “particles” as a proxy for any mass-item, in¬ 
cluding grid cells. 


^The FLASH code implements the Morton space filling curve for 
domain decomposition. 
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Figure 2: Flowchart of the parallel Barnes Sz Hut algorithm as it 
is implemented with our GPU accelerated gravity solver into the 
FLASH code. Given an initial domain decomposition, we construct 
a BH tree covering the locally stored data cells. The local trees 
are traversed and essential tree nodes are exchanged between the 
processes. With the Essential Nodes and the local data, we build 
local essential tree (LET) and traverse it to calculate the gravita¬ 
tional potential. The calculated potential is then ready to use by 
other FLASH modules and the gravity solver can be called in the 
subsequent simulation step. 


FLASH software (Fryxell et al. 20001. Our implementa¬ 


tion of the BH-tree algorithm is written in FORTRAN and 
C for the code running on CPU’s and in CUDA-C for the 
code running on the GPU-Device. Accessing the global 
data fields of the FLASH package and calling message¬ 
passing routines is solely done using CPU code. The tree 
walks and tree builds are carried out on the GPU-Devices 
using several CUBA kernels. 


3.1. The CPU Code 

The major part of the CPU Code covers the alloca¬ 
tion of data fields as well as accessing the FLASH routines 
for writing and reading global fields. Performance wise, 
the most important parts of the CPU code are the calls 
to message-passing routines. Generally, message-passing 
takes a large amount of time and therefore plays an im¬ 
portant role in optimization procedures. In our implemen¬ 
tation, message-passing routines are called to communi¬ 
cate pruned BH-trees and BH root node data. The naive 
strategy to calculate the forces and communicate data- 
on-demand requires a two-way communication and may 
cause a large communication overhead if not programmed 
carefully. Furthermore, a data-on-demand communication 
does not follow the aim to calculate the forces on the GPU. 
A GPU kernel would have to be stopped while waiting for 
the data and restarted after the data was received. To 
overcome this limitation, the BH-LET can be constructed 
prior to the force calculation ( Liu and Bhatt[ [2000[ ). For 
each FLASH call to calculate the gravitational potential, 
every process runs through a series of steps. Figure il¬ 
lustrates a basic flowchart of the algorithm. 

Whenever the GPU-BH tree solver is called from inside 
the FLASH4 environment, the AMR-Nodes are expanded 



Figure 3: Basic outline of the algorithm steps used in in the GPU- 
BH tree code. Notes in brackets mark whether the respective routine 
runs on the CPU or the GPU and if message-passing routines are 
called (MPI). 


and the cell data is read (fig. Collect cell data). As 
a next step, some basic information about the domain’s 
size and its orientation is distributed along all processes 
(fig-i Exchange basic data). In step three (fig. Cal¬ 
culate and exchange Essential Nodes), we build the local 
BH-trees and calculate the LET nodes for each process, 
which are delivered to the respective processes. The tree 
builds and the tree walks in this step are implemented 
solely on the GPU device. In step four (fig. Construct 
BH-LET and calculate potential), we build the final BH- 
LET on the GPU-device and calculate the gravitational 
potential. Finally, the calculated potential is written back 
to the FLASH internal solution vector (fig. Write forces 
to solution vector). 


3.1.1. Collect cell data and exchange basic data 


The FLASH4 simulation software (Fryxell et al.| 

20001 

uses the PARAMESH library (' 

MacNeice et al. 

2000 

Plewa 

et al. 

2005 

Deane et al. 

2001 

1 to handle its AMR mesh 


and the domain decomposition. Whenever the GPU-BH 
tree solver is called from inside the FLASH4 environment, 
the AMR-Nodes are expanded and the AMR cell data is 
read (fig. Collect cell data). To construct a local root 
node, the Center of Mass of the local data cells is cal¬ 
culated on the fly while parsing the AMR tree. Position 
data and the respective mass of every data cell is read and 
stored in arrays with a one-dimensional layout. For later 
usage on the GPU, the arrays are structured into blocks 
holding the data of exactly 32 neighboring cells. This does 
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not only preserve locality of the data cells but matches the 
warp size of the GPU, which is beneficial for the tree 
walks. 

A virtual root node (in the following referred to as “VN”) 
located at the center of mass of all local data cells is con¬ 
structed. Every VN holds position and mass data as well 
as its size in a double precision floating point variable. 
Because every process needs a copy of all VNs and the 
amount of data communicated in this step is compara¬ 
bly small, a single MPI_Allgather call realizes the VN ex¬ 
change]^ This way, the owner of every VN can be hgured 
out solely by the index. 


Essentials index: 


Pi 

P2 

PZ 

Pa 

P5 


Pn 


Exchange array: 


Xi 


XN 

yi 


VN 

Zl 


ZN 

mi 

... 



Figure 4: Structure of the index array and the exchange array. The 
index array holds pointers to the position and mass data of Essential 
Nodes and data cells. For N Essential Nodes, N integer values are 
copied from device.memory to host-memory. The exchange array is 
built by means of the index array holding AN entries. 


3.1.2. Calculate and exchange Essential Nodes 

As mentioned before, the overlap of communication 
time with calculation is an important concept in distributed 
memory approaches. When using the LET approach, the 
chance of a communication and computation overlap dur¬ 
ing the force evaluation is obviously lost. Since the vast 
majority of communication time is needed when the Essen¬ 
tial Nodes are exchanged, the respective calculations can 
take place during this communication procedure (fig. 
Calculate and send essential sub-trees). 

Figure]^ shows a flowchart of the procedures and how the 
overlap is implemented. Whenever a process acquires a 
GPU-device, the device is blocked for other processes and 
a local BH tree is build. The tree is traversed on the 
GPU for every imported VN to calculate the respective 
Essential Nodes. After all Essential Nodes for one VN are 
calculated; an index array is copied to host memorj0 We 
use an index array instead of the floating-point node data 
to reduce the memory operations during the main loop. 
The actual array of the node data is based on the index 
array and constructed on the CPU and is sent to the re¬ 
spective VN owner using non-blocking routines. The mes¬ 
sage size varies depending on the cell opening criterion and 
the number of data cells a processor holds. For a process 
holding n data cells, the number of Essential Nodes may 
vary between 1 (the process just sends its root node) and 
n (the process sends all data cells). Processes holding no 
data cells do not access the GPU device and do not build 
up a local tree. Figure illustrates the structure of the 
index- and the exchange-array. 

The next VN can be processed immediately, because 
non-blocking MPI routines (MPI_ISend) are used for the 
communication in this step. On the receiver’s side, the 
Essential Nodes are received while waiting for a free GPU 
device. Ideally, the waiting times to access a GPU de¬ 
vice and the actual calculation time on the device overlap 


3A warp is a set of threads scheduled at the same time on the 
CUDA device. The maximum number of threads in a warp is 32. 

^Using MPI_Allgather, the block of data sent from the jth process 
is received by every process and placed in the jth block of the receive 
buffer. 

^Host memory is maintained by the CPU in its own separate 
memory space in DRAM 


For each VN 



Figure 5: Flowchart of the steps carried out when calculating and 
distributing Essential Nodes. Blue boxes indicate procedures car¬ 
ried out on the GPU device, green boxes indicate CPU procedures. 
A serial GPU-device access is assumed for all contributing processes 
P1...PN (red circles). Process PI acquires the GPU and starts build¬ 
ing a local BH tree. Other processes {P2...PN) have to wait until the 
GPU device is freed by PI. The local BH tree is traversed for each 
imported VN (blue box (Walk tree)) and the calculated Essential 
Nodes are sent to the respective processes (green box (send Nodes)). 
In case all VNs are processed, PI frees the GPU device and P2 can 
acquire it to fulfill the same tasks. 


with the communication time. This is true for all but 
the first processes accessing the GPU device. The actual 
procedures carried out on the GPU device are outlined in 
section 13.21 


3.2. The CUDA Kernels 

The whole tree building process together with the cal¬ 
culation of the essential sub trees and the gravitational 
potentials is solely done in device memory. This approach 
minimizes memory transactions like copying data from the 
host to the device or vice verse. Only position and mass 
data is copied to the device and just results are copied 
back to host memory. The construction of the Barnes & 
Hut trees and the memory layout is based on the findings 


reported by (Burtscher and Pingali 2011). In the follow¬ 
ing, the data layout and the largest GUDA kernels are 
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Figure 6: Illustration of the device array structures. The Node-size I 
and the the distance delta is only used for internal nodes. The array 
for the gravitational potential is only used for leaf nodes (data cells). 
Position {x,y,z) and mass (m) arrays are used for both. 


described in detail. 


3.2.1. Data Layout 

We use several aligned scalar arrays and array indices 
to allow for coalesced memory access. The common fields 
(three dimensional position and mass) for cell data (leaf 
nodes) and internal nodes are represented with four floating¬ 
point arrays. The leaves are allocated at the beginning and 
the internal nodes at the end of the arrays. Other fields are 
only valid for leaf nodes or internal nodes. For the latter, 
we use one array to store the node size (/) and one array 
to store the distance {5) between the node’s center of mass 
and its geometric center. The node size I is calculated and 
stored during the tree build (see section; 3.2.2) and the dis¬ 
tance 6 is evaluated while calculating the centers of mass 
(see section: |3. 2.3). Finally, the array for the gravitational 


potential is only used for the leaf nodes. Figure [^outlines 
the described memory structure. 


3.2.2. Tree Building Kernel 

The Tree Building Kernel implements an iterative tree¬ 
building algorithm. Starting with a root node, all local 
data cells and all Essential Nodes are inserted into the tree. 
In case of a local BH-tree, as it is used for determining 
Essential Nodes, the root node’s position data exactly at 
the geometric center of all local data cells. In case of a 
LET, the root node is located at the domain’s center. For 
the LET it is important that all processes agree on the root 
node level, because Essential Nodes are treated as data 
cells during the tree build. In the Tree Building Kernel, 
data cells are assigned to blocks and threads within blocks 


in a round robin fashiorj^ Every thread traverses the tree 
down to the desired leaf node and inserts an index pointing 
to its cell into the free leaf node. For this, the respective 
leaf node is locked such that other threads attempting to 
access this node have to wait until the appropriate index 
is inserted. 

3.2.3. Centers of Gravitation Kernel 

In the Centers of Gravitation Kernel the masses of the 
internal nodes are calculated, their three-dimensional po¬ 
sition is updated to the center of gravity and the distance 
between the node’s center of gravity and its geometric cen¬ 
ter is evaluated. During the tree build all internal nodes 
are initiated with a negative mass indicating that their 
actual mass needs to be calculated. Since mass data is 
only available for the leaf nodes (the nodes containing cell 
data), the masses are summed up in a bottom-up manner. 
The tree is walked from bottom up where internal nodes 
are assigned to blocks and threads within blocks. Here, 
every thread processes one node and its direct children. 
For a node N with n children Ci, i = l...n each with mass 
mi and position data with coordinates ri,i = 1, ..,n. The 
mass of the node M is simply the sum of the children’s 
masses mi,.., m^ and its position is at R with: 

1 " 
i=l 

A thread with its assigned internal node at the very bot¬ 
tom of the tree simply sums up the masses and updates its 
node mass. The geometric center and its distance to R is 
calculated and stored in a separate array before the node’s 
mass and position is updated. Threads that encounter a 
child with negative mass wait until its mass is updated by 
another thread. 

3 . 2 . 4 . Find Essential Nodes Kernel 

To generate the essential sub trees, the local BH-trees 
must be traversed with regard to the VNs. Nodes and data 
cells are treated as essential if they pass the opening angel 
test. We use a slightly modified version of the opening 
angle criterion described in Eq. The imported VNs are 
treated as nodes during the tree walk, and the distance 
between a VN and an internal BH node must be evalu¬ 
ated. Here, we refer to the minimum distance dc between 
the nodes center of mass and the VN (see fig. . 

In general, the selection of Essential Nodes can be done 
using any tree traversal method. A simple algorithm tar¬ 
geted to GPU devices would be to transform a depth- 
first or breath-first traversal with each VN assigned to one 
GPU-Thread. To support the earlier described method of 


®CUDA blocks are organized into a one-dimensional, two- 
dimensional, or three-dimensional grid of thread blocks. On current 
GPUs, a thread block may contain up to 1024 threads (INVIDIAI 

[2M^. 
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divergence and can potentially increases the accuracy, the 
beneficial effect of the opening angle test is obvious lost 
for some threads. In a worst-case scenario every warp tra¬ 
verses the whole tree because the warps data cells do not 
share the same test outcomes. The number of diverging 
test outcomes can be reduced if warps process only neigh¬ 
boring data cells, because nearby data cells share the same 
interaction list (iBarnes 19901. For this, we store the data 


cells in blocks of exactly 32 (warp size) neighbors such that 
every warp reads a set of nearby data cells. 


Figure 7: The geometry of the modified opening criterion used in 
the GPU-BH tree code when calculating the Essential Nodes. 


4. Results 


overlapping and to minimize the memory footprint, the se¬ 
lection of Essential Nodes is implemented for single VNs. 
Opposed to assigning the VNs to threads, the local tree 
nodes are assigned to blocks and threads within blocks in 
a round robin fashion starting from the root node. The 
Find Essential BH-Nodes Kernel generates an index list 
holding indices for all nodes and data cells treated as es¬ 
sential for the processed VN. The CUBA kernel is called 
once for each depth and thus block level synchronization 
is assured. Neither atomic operations nor direct synchro¬ 
nization calls are used. The generated index list is copied 
back to host memory, where the respective communica¬ 
tion buffer is filled with the position and mass data of the 
Essential Nodes and data cells. 

3.3. Calculate Gravitational Potential Kernel 

Given a tree with updated centers of gravity and masses, 
the gravitational potential is calculated for all local data 
cells. The corresponding CUDA-Kernel performs a stack 
based depth-first tree traversal. The stack is created in 
shared memory to guarantee fast memory access. The 
stack size is aligned to the warp size and only the first 
thread of a warp (lane 0) is allowed to modify the stack. 
Local data cells are assigned to blocks and threads within 
blocks in a round robin fashion. The first thread of the 
warp traverses the tree and pushes the nodes onto the 
stack. Every thread reads the top node on the stack and 
calculates the distance between its data cell and the node 
in the stack. The opening angle parameter test according 
to Eq. is applied and the gravitational potential for the 
data cell is updated in cache. Note that the opening angle 
test works as a data-dependent conditional branch where 
threads in a warp could diverge. This is a potential perfor¬ 
mance bottleneck, because the warp would serially execute 
each taken branch. To overcome this performance lack, the 
thread voting function all()[^is used for the outcome of 
the test. As a result, every thread in the warp “opens” 
a node even if just one of the threads in the warp has a 
respective test outcome. Although this prevents thread 


In this section, we describe the results of three astro- 
physical motivated test problems solved using our GPU- 
BH tree algorithm: the gravitational potential of a ho¬ 
mogeneous spheroid, a homogeneous spheroid with initial 
velocity turbulence and a homogeneous oblate spheroid. 


All simulations were carried out in FLASH4.2.2 (Fryxell 


et al. 20001. 


The simulations were carried out on our institute’s cluster 
where every compute node holds two Intel Xeon Processors 
with each having six cores with a clock speed of 2.4 GHz 
and one NVIDIA Tesla G2075 GPU device with 6 GB of 
GDDR5 memory and 448 CUBA cores with a clock speed 
of 1.15 GHz. The GPU-Gode was compiled using the Intel 
compilers (IGG v. 10.1, IFORT v. 10.1) with aggressive 
optimization flags (-03). The GPU code was compiled us¬ 
ing the NVGG compiler (Guda compilation tools, release 
7.0, V7.0.27) with default settings. The runtime of the 
simulations were measured using different timer functions 
provided by the FLASH4 API. 


^The allO warp voting function evaluates a predicate for all 
active threads of the warp and returns non-ze ro if and only i f the 
predicate evaluates to non-zero for all of them INVIDIA 2015 i. 
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Figure 8: Potential of a MacLaurin spheroid with eccentricity 0.9 as computed with the GPU-BH tree code using a = 0.25 and using an 
adaptively refined mesh with seven levels of refinement. The left image shows the potential and AMR block outlines (grey and black lines) 
in the x-y plane passing through the center of the ellipsoid. The right image shows the same quantities for the y-z plane. 


4-1. Accuracy with the MacLaurin Sphere 

To test the accuracy and the influence of the open¬ 
ing angle parameter a, we use a so called “MacLaurin” 
spheroid. The gravitational potential at the surface of, 
and inside such a homogeneous spheroid is expressible in 


terms of analytical functions (Chandrasekhar 19691. 


For a point inside the spheroid with density p the gravita¬ 
tional potential is 


<i)(a;) = 7rGp[2Aiai — Ai{x^ -\- y^) — ^3(03 — 


( 6 ) 


where oi, 02 and 03 are the semi major axes of the spheroid 
and oi =02 > 03 . Here 


•\/l — e 


Ai = 


^.3 = ^ — 


2 . _i l-e2 
sm e- 3 —. 




sin 


-1 


where e is the ellipticity of the spheroid: 




(7) 

( 8 ) 

(9) 


For a point outside the spheroid, the potential is: 

oietan”^ ^ ~ ^ ^ ~ ^jpi 


+ 2z^{h — tan ^ h) 


( 10 ) 


where 


oie 

i/og -I- A 


and A is the positive root of the equation 


( 11 ) 


- 1 ( 12 ) 
a? + A ^ oi+A ^ oi + A ^ ’ 

The simulation was setup with a uniform density p = 
1 g cm“^ inside the spheroid and p = e —>■ 0 outside the 
spheroid with an eccentricity of 0.9. The spheroids were 
centered in a box with unit dimensions. An adaptively 
refined mesh with seven levels of refinement (« 7.2m data 
cells) was used. Using our GPU-BH tree code and the 
CPU-BH tree code within FLASH4, we computed poten¬ 
tials with varying a values (a = 1.0, 0.75, 0.5, 0.25, 0.1). 
An example of the potential for a = 0.25 computed with 
the GPU-BH tree code using a maximum refinement level 
7, is shown in fig. All simulations were run using 12 
CPU-Cores with two GPU devices utilized by the GPU- 
BH tree code. 


We compare the analytical solution for the gravitational 
potential (/^MacLaurin to the potential calculated with the 
GPU-BH tree code (pcpu and the CPU-BH tree code (/>cpu- 
For the tests, we evaluate the relative error i^err with: 


^MacLaurin 0<algorithm> 


^MacLaurin 


( 13 ) 
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Figure 9: Maximum relative errors ((^err) in the gravitational poten¬ 
tial for the MacLaurin sphere calculated with the GPU-BH tree code 
( \ and the CPU-BH tree code | | M [ l for different a values. The 

GPU-tree code produces a higher accuracy for q values between 0.2 
and 0.8 while the CPU tree code shows an accuracy advantage for a 
values > 0.8. Here, an alpha value of 0.0 refers to the direct O(n^) 
algorithm. 


For comparison, we calculated the potential with the di- 
rect O(n^) method where a relative error of 2.17 x 10“'*’ 
was calculated. Typically, the Barnes & Hut algorithm 
achieves higher accuracy with lower a values ([Barnes and 


Hut , 1986 Hernquist 1987| ). Figure clearly shows that 


this is true for the GPU-BH tree code and for the GPU-BH 
tree code. Both, the GPU-BH tree code and the GPU-BH 
tree code produced relative errors below 1%. For a values 
near 0.2 both solvers nearly reached the accuracy of the di¬ 
rect summation O(n^) method. For a values between 0.25 
and 0.8 the GPU-BH tree code calculated a more accurate 
potential than the CPU-tree code. For values above 0.8 
the CPU-BH tree achieved a higher accuracy. The differ¬ 
ences between the GPU-BH and the CPU-BH accuracies 
are a direct result of the different strategies when applying 
the opening angle test. 


4-2. Scaling of the GPU-BH tree code with the top-hat 
sphere setup 

In this section, we summarize the results of different 
performance tests based on the collapse of spherical cloud 
cores with and without initial velocity fluctuations. The 
setup parameters of our different simulations can be found 
in tables [HandO 


4-2.1. Simulation setup 

We perform the scaling test with a top-hat (TH) sphere 
setup with the following parameters (see also table [I]). We 
use a mean molecular weight oi fi = 2.3 g/mol with an 
isothermal equation of state (EoS) at a temperature of 


20 K. The density profile of the system with a sphere 
radius of Rq is described by a step function 


P = 


ip) 

O.Ol(p) 


for r<Ro 
for r > Rq. 


(14) 


With the density (p) = 1.76 x 10 g cm ^ leading to a 
free fall time of 


tff = 


Stt 


32G(p) 


= 1.58 X lO^^s « 50 kyr. (15) 


The initial sphere is highly gravitationally unstable with 
the Jeans length 





= 5736 au = 0.28 


Rq. 


(16) 


We use the top-hat sphere with varying parameters for our 
performance tests. To investigate the performance influ¬ 
ence of varying GPU/CPU ratios, we use different sphere 
radii (i?o) refinement levels. One simulation setup 
with a sphere radius of i?o = 3.5 x 10*^^cm and Zmax = 6 
resulting in « 7 x 10® data cells (SP-RSS) and one with 
a radius of Rq = 1.73 x lO^^cm and /max = 7 resulting in 
« 20 X 10® data cells (SP-RSL). 

For the weak scaling test simulations (SP-WS), we scale 
the problem size to the number of processing units by mod¬ 
ifying the sphere radius Ro- Here, we use maximum refine¬ 
ment level of /max = 8 ^ Aa; = 7.8125 x 10*^^ cm. We also 
performed collapse simulations with the turbulent TH se¬ 
tups (SP-TL) for a runtime of exactly one week. Here, we 
initialize the TH-profile with an additional random super¬ 
sonic velocity field with a Mach number of Ma = 2.0. The 
average turbulence crossing time is ttc(-/?o) = 1.8 x 10® yrs 
which is about 3.5 times larger than the free fall time 
tff. An overview of the physical parameters is given in 
table [2 In these simulations, we set the maximum re¬ 
finement level to /max = 15 which results in a maximum 
resolution of 131072 grid cells in one direction, correspond¬ 
ing to « Ax « 0.4 au. For the collapse simulations, we 


applied the Truelove criterion (Truelove et al. 
resolve the Jeans length 


19971 to 


Aj = 


Gpn 


(17) 


throughout the simulation. Furthermore, we used the sink 
particle module with the Federrath criterion for sink cre- 


accretion radius of r^ 
density pmax of 


ation (Federrath et al. 2010). The sink particles have an 


= 3Aa; which leads to the threshold 


Pmax — 


TTC" 


4G {ZAxf 


= 2.518 X 10““ g cm“® 


(18) 


For the strong scaling tests, we use the SP-TL sim¬ 
ulation setup with a resolution of 12 grid cells per jeans 
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Table 1: Simulation parameters for the turbulent top-hat sphere 


Parameter 


Value 

Simulation box size 

-^^box 

8 x 10 ^^ cm 

Smallest cell size 

Ax 

6.1 X 10^^ cm 

Max. refinement 

^max 

15 

Min. refinement 

^min 

4 

Sink particle accr. radius 

^ accr 

1.83 X 10^® cm 

Max. density 

Pmax 

2.52 X 10"“ g cm"® 

Opening angle parameter 

a 

0.5 

Sphere radius 

Ro 

3 X 10®^ cm 

Total sphere mass 

Mtot 

100 Mo 

Mean density 

ip) 

1.76 X 10"^® g cm"® 

Max. gas density 

Pmax 

9.67 X 10"®® g cm"® 

Sound speed 

Ca 

0.27 X 10^ km s"® 

Mean free fall time 

iff 

1.58 X 10®® s « 50 kyr 

Turbulent crossing time 

itc 

5.5 X 10®® s 

Jeans Mass 

Mj 

1.23 Mo 

Jeans length 


1.4 X 10®®^cm 


Physical and numerical simulation parameters for the turbulent top- 
hat sphere simulations. 


length. We run the simulation with our accelerated grav¬ 
ity solver until two collapse regions including several sink 
particles are formed (tg = 1.2 x 10^^ s « 3.8 x 10'^ yrs). 
The corresponding checkpoint file is then used as an initial 
structure for the strong scaling test simulations (SP-TRS) 
with our accelerated solver, the CPU-BH tree solver and 
the Gridsolver. 


4-2.2. Scaling with different GPU/CPU ratios 

In general, FLASH 4 (Fryxell et al. 20001 simulations 
are setup for distributed-memory machines where several 
MPI (Gabriel et al. 20041 processes are created. Since 
every process holds its own CUBA context and kernels are 
effectively serialized as they launch on the GPU, several 
MPI ranks compete for the device and slow down the code 
execution. Figure 10 shows an example sequence of the 
access order for 4 processes trying to access the same GPU 
device (4 MPI Ranks per compute node). 

In order to compare the runtimes of the GPU-BH tree 
code at different CPU/GPU ratios with the runtime of 
the CPU-BH code, we calculate the respective speedup S 
with: 

Tcpu-bh 


S = 


Tgpu-bh 


(19) 


where Tcpu-bh is the time for one gravity step evaluated 
for the CPU-BH tree code and Tqpu-bh is the time for one 
gravity step evaluated for the GPU-BH tree code. Inde¬ 
pendent from the problem size (number of data cells), the 
highest speedup is expected for a one to one GPU/CPU 
ratio. We analyzed the performance depending on differ¬ 
ent GPU/CPU ratios with the SP-RSS run using « 7 x 10® 


Time 



Rank 1 
Rank 2 
Rank 3 
Rank 4 



GPU 1 


Figure 10: Illustration of the serial access sequence for processes 
competing for the same GPU device. Direct access to the GPU 
device is only granted for the first process in the queue (Rank 1). 
All other processes wait for the GPU device to be available again. 



Figure 11: The initial density and AMR block distribution for the 
SP-RSL setup. The image shows a slice through the box along the 
X-axis. An adaptively refined mesh with seven levels of refinement 
was used with a sphere radius of 1.73 x 10^^ resulting in 20 x 10® 
data cells. 


cells and with the SP-RSL run with « 20 x 10® cells (see 

fig-[n|)- 

Figure [T^ shows the speedup calculated with Eq. [T9| for 


GPU/CPU ratios of 1/12, 1/8, 1/6, 1/4, 1/2 and 1/1 while 
maintaining a constant number of 24 cores for each ra¬ 
tio. Here, we reach a maximum speedup of « 45 with 
the one-to-one GPU/GPU ratio (24 GPU devices and 24 
CPU cores) for the simulation with « 20 x 10® data cells 
(fig. 12 —Similarly, the lowest speedup factor of « 5 


was evaluated for the 1/12 ratio (2 GPU devices and 24 
CPU cores) caused by the long GPU device access times 
following the serialized access pattern illustrated in fig. 

One way to circumvent this performance bottleneck is 
to collect all data on one of the CPU cores sharing one de¬ 
vice (fig. ElBH Although intra node communication 


®This is only practical for small problem sizes because the GPU- 
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Table 2: Scaling of the different top-hat sphere simulations 


Name 

Solver 

Steps 

Ro [cm] 

^max 

Aa: [cm] 

Resolution [cells] 

Cell count 

Figure 

SP-RSS 

CPU-BH 

10 

3.50 X 10^^ 

6 

3.12 X 10^® 

256® 

7 X 10® 

12 





3.50 X 10^^ 


3.12 X 10^® 






GPU-BH 

10 

6 

256® 

7 X 10® 

12 




SP-RSL 

GPU-BH 

10 

1.73 X 10^^ 

7 

1.56 X 10^® 

512® 

20 X 10® 

TT 


T5 

R’Fl 


GPU-BH 

10 

4.50 X 10^® 

8 

7.81 X 10^'‘ 

1024® 

2.73 X 10® 

15 


16 

IHI 




3.60 X 10^^ 




1.48 X 10® 




SP-WS 

CPU-BH 

10 

4.50 X 10^® 

8 

7.81 X 10^'‘ 

1024® 

2.73 X 10® 

15 






2.61 X 10^^ 




7.42 X 10'^ 





SP-TL 

CPU-BH 

GPU-BH 

2006 

4465 

3.0 X 10®^ 

3.0 X 10®'^ 

15 

15 

6.10 X 10®® 

6.10 X 10®® 

131072® 

131072® 

3.5 X 10'® 

4.4 X 10'® 

18 

TS 


19 

t5 


50 

50 


GPU-BH 

2 

3.0 X 10®^ 

15 

6.10 X 10®® 

131072® 

1.85 X 10® 

22 


23 

57 


24 

5S 

[2^ 

SP-TS 

CPU-BH 

2 

3.0 X 10®'^ 

15 

6.10 X 10®® 

131072® 

1.85 X 10® 

55 

27 


5S 

28 


5S 

m 


Gridsolver 

2 

3.0 X 10®^ 

15 

6.10 X 10®® 

131072® 

1.85 X 10® 

55 

57 


5S 

5S 



Top-hat sphere simulations used for our scaling tests. The cell count for the SP-TL setups refer to the final stage of the simulations. The 
radii for the the SP-WS simulations are used to scale the cell count for the weak scaling test simulations (see table for a list of radii and 
their corresponding cell count). 


is comparably fast and the number of required tree builds 
is reduced, the evaluated speedup factors for this approach 
are generally lower compared to the previously described 
approach. 

Nevertheless, we achieved a nearly linear scaling effi¬ 
ciency considering the count of GPU-devices. In Fig. 
we show the strong scaling efficiency for different GPU to 
GPU ratios. Starting with a time U for one gravity step 
at a GPU/GPU ratio of 1/12, the percentual efficiency at 
a specific ratio with the respective time t-c is given as: 

* 100 ( 20 ) 


The results are shown for the two simulations SP-RSS and 
SP-RSL. Note, that the base ratio of 1 GPU device for 12 
GPU cores (ti) is achieved using 2 compute nodes with a 
total of 24 cores and 2 GPU devices and already includes 
communication times for inter and intra node communica¬ 
tion. (The presented speedup and scaling values refer to 
the gravity solver module only.) Although we measured a 
speedup factor of « 40 compared to the original CPU-BH 
tree code (see fig. 12 —the entire wall time for the 


simulation did not show the same speedup since our grav¬ 
ity solver only makes up less than half of the entire wall 
time for the related GPU/GPU ratio. 


Figure shows the percentual fractions of our GPU 
accelerated gravity solver and the hydro solver for differ¬ 
ent GPU/GPU ratios. The presented values are evaluated 


device memory is comparably small and limited. 



« 20 X 10® cells « 7 X 10® cells 
« 7 X 10® cells (G) 


Figure 12: Speedup factors for the gravity module using different 
GPU/CPU ratios for the GPU-BH tree code in comparison to the 
GPU-BH tree code. The Y-axis refers to the speedup of the GPU-BH 
tree code in comparison to the GPU-BH tree code and the X-axis 
refers to the GPU count for different GPU/GPU ratios using a total 
of 24 CPU cores. We see the highest speedup with the SP-RSL 
setup ( | • | |. With the smaller simulation setup SP-RSS j M [ ), 
the speedup is slightly lower especially when using less CPU cores 
per GPU. The speedup factor s with the SP-RSS setup-using the 
gathered data approach ( | ^ [ | show the lowest speedup compared 
to the other setups. The highest speedup of Pi 45 was gained for a 
1/1 GPU/CPU ratio (24 GPU devices and 24 CPU cor es) wit h the 
SP-RSL simulation setup holding Pi 20 x 10® data cells j • [ |. 
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2 3 4 6 12 

GPU [#] 


linear —« 20 x 10® cells 
—■— « 7 X 10® cells 


Figure 13: The str ong scaling efficiency of our GPU-BH tree solver 
calculated with Eg. |20| for varying CPU/GPU ratios and a constant 
number of CPU cores. The X-axis refers to the GPU count for dif¬ 
ferent GPU/CPU ratios using a total of 24 CPU cores. We see a 
nearly linear strong scaling for the SP-RSS simulation | | M [ l and 
the SP-RSL simulation | | • [ l. 



GPU [#] 


0 Gravity D Hydro 


Figure 14: Runtime fractions of the ppm hydro solver ( Q 
our accelerated gravity solver ( |n| ) for different CPU/GPU ratios. 
The X-axis refers to the GPU count for different GPU/CPU ratios 
using a total of 24 CPU cores. Values are evaluated for the SP-RSL 
simulation with Pi 20 x 10® data cells. With a lower runtime of 
the gravity solver for increasing GPU/CPU ratios, the hydro solver 
becomes the dominating solver for the simulations. For a CPU/GPU 
ratio of 6/1 (4 GPU devices and 24 CPU cores), already 55% of the 
evolution time was spent in the hydro routines. 


for the (SP-RSL) simulation with « 20 x 10® data cellsj^ 
At a ratio of 1/12, we find the runtime share of our grav¬ 
ity solver is the dominating part with « 65% of the total 
wall time{^ Since our accelerated gravity solver benefits 
from a lower GPU/GPU ratio, its contribution to the en¬ 
tire simulation time is reduced with lower ratios. Hence, 
we measure lower runtime shares with lower ratios. Al¬ 
ready for a ratio of 1/6, our solver contributes only « 45% 
to the wall time. Finally, we measure only a « 23% con¬ 
tribution for a ratio of 1/1 and find the hydro solver to be 
the dominating code part (see fig. 14). 


4-2.3. Weak scaling with the top-hat sphere 

For the weak scaling, the problem size (number of data 
cells) assigned to each processing unit stays constant. We 
refer to a processing unit, “Py” as a unit of six CPU cores 
plus one GPU device^] We use the weak scaling efficiency 
given by: 

where ti is the wall time of one gravity time step with one 
P\] and In is the time of one gravity step with N process¬ 
ing elements. We use the SP-WS top-hat sphere setup and 


^Values evaluated for the smaller SP-RSS simulation with ^ 7 X 
10® data cells only differ for ^ 2% to the presented values. 

^®For the same simulation setup the CPU-BH gravity solver dom¬ 
inated the runtime with ^ 88%. 

^^We choose six cores to fully load exactly one of the hexacore 
CPUs on each compute node and use the MPI call to control the 
respective processor binding affinity. 


scale the problem size to the number of Pu by changing 
the sphere radius i?o as a result, the overall cell count. 
Table gives an overview of the different sphere radii and 
cell counts used for the weak scaling tests. (The scaling 
results are normalized to match 1 Leaf-node per core.) 

In fig. 15 we show the weak scaling efficiency (Eg. [2l]) for 
three different simulation sizes with the SP-WS setup hold¬ 
ing « 8.68x10® ( |->-| ), 6.72x10® d^l ) and 4.4x10® ( |-^| ) 
data cells assigned to each GPU core. We see a drop of the 
weak scaling efficiency with increasing number of process¬ 
ing elements. The main reason for this loss in efficiency 
is not only based on increased communication times also 
the algorithm design is a factor. Especially the method 
of calculating the essential nodes plays a role, since the 
local tree is traversed once for every process. Although 
this is beneficial to overlap computation with communi¬ 
cation time, the loop size increases with higher numbers 
of processing elements. As a result, more data is copied 
from device memory to host memory and more GPU ker¬ 
nels are started. The fractional runtimes of the respective 
memory operations and kernel executions are outlined in 
fig. With 16 Pjj, the memory operations and kernel 
executions during the calculation of the Essential Nodes 
make up « 10% of the total GPU time which is domi¬ 
nated by kernel executions. Note, that the actual runtime 
per kernel does not change but the kernel is started more 
often. The large number of kernel calls and memory oper¬ 
ations is clearly reflected in the runtime of the the routines 
for calculating the Essential Nodes. Figure [TT] illustrates 
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Table 3: AMR block count with different sphere radii 


Ro 

Blocks 

Leaf-blocks/Core 

cells 

Pv 

SPWS with R 

i 4.4 X 10® data cells/cores 


4.50 X 10^® 

6089 

« 888 

2.73 X 10® 

1 

9.70 X 10^® 

24457 

« 891 

1.10 X 10^ 

4 

1.36 X lO^’’ 

44937 

Ri 819 

2.01 X 10^ 

8 

1.70 X 10^'^ 

71945 

Ri 874 

3.22 X 10^ 

12 

1.98 X lO^’’ 

94985 

Ri 865 

4.25 X lO’’ 

16 

2.20 X lO^’’ 

118537 

Ri 864 

5.31 X lO’’ 

20 

2.40 X lO^’’ 

141257 

Ri 858 

6.32 X 10^ 

24 

2.61 X lO^’’ 

165705 

Ri 863 

7.42 X lO’’ 

28 

SPWS with « 

6.72 X 10® data cells/cores 


5.70 X 10^® 

8905 

Ri 1298 

3.99 X 10® 

1 

1.19 X lO^’’ 

36617 

Ri 1335 

1.64 X 10^ 

4 

1.70 X 10^'^ 

71945 

Ri 1311 

3.22 X 10^ 

8 

2.08 X 10^'^ 

105993 

Ri 1288 

4.75 X 10^ 

12 

2.43 X lO^’’ 

144905 

Ri 1320 

6.49 X 10^ 

16 

2.70 X 10^'^ 

178889 

Ri 1320 

8.01 X 10^ 

20 

2.98 X lO^’’ 

217033 

Ri 1318 

9.72 X lO’’ 

24 

3.23 X lO^’’ 

253577 

Ri 1320 

1.14 X 10® 

28 

SPWS with « 

8.68 X 10® data cells/cores 


6.50 X 10^® 

11721 

Ri 1709 

5.25 X 10® 

1 

1.36 X lO^’’ 

44937 

Ri 1638 

2.01 X 10^ 

4 

1.98 X 10^'^ 

94985 

Ri 1731 

4.25 X 10^ 

8 

2.40 X lO^’’ 

141257 

Ri 1716 

6.32 X 10^ 

12 

2.75 X 10^'^ 

183497 

Ri 1672 

8.22 X 10^ 

16 

3.10 X 10^'^ 

232905 

Ri 1698 

5.25 X 10® 

20 

3.40 X lO^’’ 

277193 

Ri 1684 

1.24 X 10® 

24 

3.60 X 10^'^ 

330377 

Ri 1720 

1.48 X 10® 

28 


The AMR block count for different sphere radii used to scale the 


top-hat sphere simulations SPWScpij and SPWSgpv - 






« 8.68 X 10® cells/cores 
« 6.72 X 10® cells/cores 
« 4.4 X 10® cells/cores 
4.4 X 10® cells/cores (CPU) 


Figure 15: Weak scaling efficiency in % of linear for the top-hat 
simulations with the SP-WS setup (gravity only). The scaling values 
for the smaller simulations with 4.4 x 10^ data cells/cores run with 
our accelerated solver ( | M [ | and with the CPU-BH solver p^#— [ show 
a similar scaling trend for more then 15 Pij. We achieved the best 
scaling results for ou r solve r with the larger system holding 5^ 8.68 x 
10^ data cells/cores ( | • | l. 
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Figure 16: Weak scaling of the runtime fractions of GPU device 
memory operations and GPU kernel executions. The blue and the 
yellow bars refer to the runtimes of the memory operations and the 
kernel executions during the calculation of the Essential Nodes. The 
red and the grey bars show the proportions for the code calculating 
the final potential. The values are calculated for the SP-WS simu¬ 
lation with P:: 8.68 X 10^ data cells. One Pv refers to 6 CPU cores 
and 1 GPU device. 


the growing runtime contribution of the respective rou- 
tines with the larger top-hat setup holding « 1 x 10® data 
cells per core. We see that the routines for calculating 
the Essential Nodes take a maximum of « 15% of the to¬ 
tal solver runtime with 28 Pjj. Nevertheless, at the same 
count nearly « 80% of the linear scaling efficiency was 
reached with the small system and « 88% was reached 
with the larger system. For the smaller setup, the CPU- 
BH and the GPU-BH simulations show nearly the same 
weak scaling efficiency for Py > 15. 

4-2.4- Comparison of GPU-BH and CPU-BH for fixed run¬ 
time (7 days) 

We run two simulations of the top-hat sphere setup 
with turbulence and sink particles (SP-TL). One simula¬ 
tion with our GPU accelerated gravity solver and one with 
the CPU-BH Tree solver. While the CPU only simulation 
was run with 72 CPU cores, we used 60 CPU cores and 10 
GPU-devices for the GPU accelerated simulation. Both 
simulations were run for = 168 h = 7 days. With the 
CPU-BH simulation, we reached a total of 2006 evolution 
steps and a final simulation time of C ~ 1-0 x 10^ yrs. 
Since the simulation stopped at an early stage of the col¬ 
lapse with a maximum density of 9.12 x 10^^ g cm“®, nei¬ 
ther the maximum refinement fevef was reached nor any 
sink particfes were formed. 

With the GPU acceferated simufation, 4465 evofution 
steps were executed and a simulation time of tg ~ 1.6 x 
10^ yrs was reached with a maximum density of 3.73 x 
10^® g cm“®. Similar to the CPU only simulation, the 


Figure 17: Weak scaling runtime fractions of different parts of the 
GPU-BH tree gravity solver for the SP-WS simulation with Ri 8.68 x 
10® data cells per CPU core. The runtime fraction of the routines 
for calculating the Essential Nodes grows with increasing number of 
Pu- One Ptj refers to 6 CPU cores and 1 GPU device. 


maximum refinement level was not reached and no sink 
particles were formed. Note, that the GPU-BH tree simu¬ 
lation reached the 2006 step mark already after « 2.6 days. 
An overview of the simulation parameters for the final 
stages is given in table For comparison, we show slices 
of the density distributions of both simulations at the same 
simulation time (C = 10.2 kyr) in fig. 18 A simple com- 


Table 4: Final simulation parameters 


Parameter 

CPU-BH 

GPU-BH 

Max. refinement 

9 

io 

Leaf-blocks total 

69273 

85261 

Leaf-blocks/CPU 

« 962 

« 1421 

Max. gas density 

9.12 X 10^'^ g cm"® 

3.73 X 10“ g cm"® 

Evolution steps 

2006 

4465 

Max. sim. time 

3.2185 X 10“ s 

5.2345 X 10“ s 


Simulation parameters in the final stage (after 168 compute 
hours, i.e. 7 days) of the GPU accelerated simulation and the 
CPU only simulation. 


parison between the number of executed evolution steps 
suggests a speedup factor for the GPU accelerated sim¬ 
ulation of « 2.2. At ts = 10.2 kyr where both simula¬ 
tions hold approximately equal numbers of leaf blocks, 
we find a total simulation speedup factor for the GPU- 
accelerated simulation of « 2. Note that this simu¬ 
lations ran with a 1/6 GPU/GPU ratio, nevertheless, we 
expect a much larger speedup for larger GPU/GPU ratios. 


®^These estimations do not take into account the workload differ- 
ence caused by different CPU counts. 
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(a) GPU-accelerated at tgim = 1-02 x 10“^ yrs, twall = 62 h (b) CPU only at £sim = 1*02 x 10^ yrs, twall = 168 h 

Figure 18: Slices of the density for the SP-TL simulation at tgim = 1-62 x 10^ yrs. The left plot shows the GPU accelerated simulation while 
the right plot shows the CPU only simulation. The respective simulation step was reached after 7 days with the CPU only solver and after 
^ 2.6 days with the GPU accelerated solver. The CPU simulation was executed with 72 CPU cores and the accelerated GPU simulation was 
executed with 60 cores and 10 GPU devices. 


Furthermore, in fig. we show the runtime fractions of 
different modules for both SP-TL simulations. Following 
our expectations from the previous simulations (141, we 
find the runtime fractions of our gravity solver and the 
hydro solver to be roughl y the same with 4% differ- 

GPU jP □|[]d| ), GPU-FIN( pa|D4 )• With the 
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ence (fig. 

GPU only SP-TL simulation, the GPU-BH gravity solver 
is by far the dominating module with over 80% wall-time 
contribution. In the GPU accelerated SP-TL simulation, 
the number of AMR blocks increases with the simulation 
time and as a result, the workload for our GPU-BH tree 
solver grows. The runtime fractions of the major routines 
executed during the evolution of the accelerated SP-TL 
simulation are shown in fig. |20[ We see a growing run¬ 
time contribution of our gravity solver during the first 
1000 steps of « 8% from the initial 27% to 35%. After 
1000 steps, the runtime contribution slowly converges to 
« 36%. Furthermore, we show the fractional contribu¬ 
tions of the tree-algorithm in fig. Here we see that the 
fractional increase of the gravity solver (see fig. 20) comes 
mainly from the additional communication time with the 
growing number of data cells. The communication dom¬ 
inates the runtime during the early steps {N < 500) due 
to small number of grid cells. 


4-2.5. Strong scaling with the turbulent top-hat sphere 
We evaluate the strong scaling capabilities of our GPU- 
BH solver with the turbulent top-hat sphere setup. For 
this, we run the SP-TL simulation until tg = 1.2 x 10^^ sec R 
3.8 X lO'^ yr. At his point, two collapse regions and 26 sink 
particles with a total mass « 0.42 Mq are formed. We 



D D Grid D D Particles D D Hydro D D Gravity 


Figure 19: Runtime fractions of different code parts for the CPU only 
(CPU) and GPU accelerated (GPU, GPU-FIN) SP-TL (CPU) sim¬ 
ulations. The values for CPU and for GPU-FIN refer to the whole 
simulation time of the CPU only and the GPU accelerated simu¬ 
lation. The values for GPU refer to the GPU accelerated SP-TL 
simulation at a state near the CPU simulatio’s final state. For the 
CPU only simulation, the gravity solver dominated the runtime with 
more than 80%, while for the GPU accelerated simulations, the hy¬ 
dro solver is the dominating module. (The GPU accelerated SP-TL 
simulation was run with 10 GPU devices and 60 CPU cores.) 
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Figure 20: Runtime fractions of the major modules used in the GPU 
accelerated SP-TL simulation. The fractional time of the accelerated 
gravity sol ver (f— M —\ grows as a result of the increasing number of 
data cells 1 |- - - [ I. 
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Figure 21: Average runtime fractions of selected routines of the 
GPU-BH tree solver during the evolution of the SP-TL simulation. 
The routines solely executed on the CPU (gathering data, alloca¬ 
tion, deallocation, ...) stay at a constant fraction ( |— •— [l, while the 
fraction of the non overlapped communication time j— »— [ | is highly 
affected by the increasing cell count ( |- - - | |. 



Figure 22: The SP-TL simulation at tg = 1.2 x 10^^ s ps 3.8x 10'^ yrs. 
Plots show the column density g cm“^ for the lower collapse region 
at different resolutions. From left to right, the box length is reduced 
from 8.0 x 10^^ cm over 8.0 x 10^® cm to 3.0 x 10^® cm. Sink particles 
appear as white dots in the large image to the right. 


used a resolution of 12 data cells per Jeans length to scale 
the simulation to hold 1.85 x 10^ cells. Figure 22 and 


23 show the two collapse regions at different resolutions. 


We use the final checkpoint file of this simulation as a 
starting point for the strong scaling tests (SP-TS). 

For the strong scaling, the problem size stays constant and 
we refer to the strong scaling efficiency (as a percentage of 
linear) with 

(lAo * 

where G is the time spent in the gravity unit with one pro¬ 
cessing element, N is the number of processing elements 
and tjv is the time spent in the gravity unit with N pro¬ 
cessing elements. Again, we use six CPU cores and one 
GPU device as a single processing unit (Pu)- 
Figure shows plots of the strong scaling efficiency of 
different code parts of our gravity solver and the main 
modules run during the simulation. The values are eval¬ 
uated for 4, 8, 12, 16 , 20 and 24 Py- For each test, we 
measured the runtime for 2 evolution steps starting from 
the afore mentioned checkpoint file. The total scaling ef- 
hciency of our gravity solver dropped down to « 73% for 
24 Py. The loss in efficiency can be attributed to the 
routines for calculating the Essential Nodes (fig. 24 


since these show the largest drop in efficiency. We trace 
this back to the increased loop size and increased memory 
operation cost as a result of the higher number of cores. 
For comparison, we run the same simulation setup with 


the GPU-BH tree solver and the Gridsolver. Figure 25 


shows the evaluated strong scaling efficiency for the respec¬ 
tive simulations. We see a super linear scaling efficiency 
of the GPU-BH tree code (hg. 25 —I and sub linear 


strong scaling of our accelerated solver (hg. 
the Gridsolver (hg. 25 


and 

The Gridsolver and our ac- 
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Figure 23: The SP-TL simulation at tg = 1.2 x 10^^ s 3.8 x 10"^ yrs. 
Plots show the column density g cm~^ for the upper collapse region 
at different resolutions. FVom left to right, the box length is reduced 
from 8.0 x 10^^ cm over 8.0 x 10^^ cm to 3.0 x 10^^ cm. Sink particles 
appear as white dots in the large image to the right. 
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Figure 25: Strong scaling efficiency for the gravity unit in % of linear 
with the SP-TS setup for our GPU accelerated gravity solver l | ♦ [ | 
the CPU-BH tree solver l | ■ [ |, and the Gridsolver l | • [ |. The 
CPU-BH tree solver shows a super linear strong scaling behavior 
l | M [ l, while our accelerated solver | | • [ l and the Gridsolver l | • [ | 
show non ideal scaling capabilities. One Pu refers to 6 CPU cores 
and 1 GPU. 
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Figure 24: Strong scaling efficiency in % of linear for different code 
parts of the GPU-BH tree solver and the major evolution routines 
executed during the SP-TS simulations. One P[j refers to 6 CPU 
cores and 1 GPU. The strong scaling efficiency of our gravity solver 
(| * || drops down to Ri 73% of the linear scaling. 


celerated solver show an almost identical scaling behavior 
with a slight advantage of « 3 % towards our solver. 
Although the strong scaling efficiency of our solver shows 
a non linear behavior, we find a reasonable performance 


gain compared to the other solvers (see fig. 26). In com¬ 
parison to the CPU-BH tree code, we find a high speedup 
of factor 19 with 4 Pjj and a speedup factor of 13 with 24 
P\j (fig. 26 —•—[). In comparison to the Gridsolver, our 


GPU accelerated solver was roughly 5 times faster for all 
numbers of Pu (fig. 


Following the speedup factor, we see a much higher data 
throughput with our accelerated solver than with the other 


solvers (see fig. 27). For our accelerated solver, we eval¬ 
uated a data throughput of « 2.0 x 10® data cells/sec 
with just 4Pu (fig. ^7\ [—•—[ )■ At the same cores, we find 
much lower data throughput of « 4.7 x 10^ cells/sec for 
the Grdsolver and « 1.0 x 10"^ cells/sec for the GPU-BH 
tree solver. We find the maximum data throughput val¬ 
ues with « 6.6 X 10"*^ cells/sec with the CPU-BH solver 
« 1.7 X 10® cells/sec with the Gridsolver and « 8.7 x 10® 
cells/sec with our accelerated GPU-BH solver for 24Pu . 
Note that even with additional 120 cores (20Pu) both the 
Gridsolver and the CPU-BH tree solver stayed below the 
« 2.0 X 10® data cells/sec mark which our solver reached 
with 4 Pu. The values reported in figure only refer to 
a GPU/GPU ratio of 1/6 and we find higher speedup fac¬ 
tors when utilizing less CPU core for each GPU. Figure [28| 
shows the calculated speedup factors for different ratios. 
We find the highest speedup of factor 63 for our GPU-BH 
tree solver compared to the GPU-BH tree solver when run 
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Figure 26: Strong scaling speedup factors of our accelerated gravity 
solver compared to the CPU-BH tree solver and the Gridsolver. The 
values are evaluated for the SP-TS simulation for the gravity unit 
only. One Pv refers to a unit of 6 CPU cores and 1 GPU. 
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Figure 28: Speedup factors of our accelerated gravity solver run on 
24 cores with different CPU/GPU ratios compared to the CPU-BH 
tree solver and the Gridsolver run on 24 and 144 CPU cores. The 
values are evaluated fort he SP-TS simulation. The X-axis refers to 
the GPU count for different GPU/CPU ratios using a total of 24 
CPU cores. 
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Figure 27: Strong scaling data throughput in data cells/sec of the 
different gravity solvers for the SP-TS simulations. With the CPU- 
BH tree solver l | B [ | and the Gridsolver ( | • | | , we reached the 
maximum data throughput of 5 x respectively 1.16 x 10^ data 
cells per second with 24 Pv A similar value of bb 1.97 x 10^ data 
cells per second is reached with our accelerated gravity solver | | • [ l 
with only 4Pu- For our accelerated solver, we reached the maximum 
of nearly 9 x 10^ data cells per second with 24 Pij. One P\j refers to 
6 CPU cores and 1 GPU. 


with the same number of CPU cores and a CPU/GPU 
ratio of 2/1 (12 GPU devices and 14 CPU Cores). Even 
when utilizing 144 (6 times more) CPU cores to the CPU- 
BH solver, we find our GPU accelerated solver to be over 9 
times faster. We measure the speedup based on the “evo¬ 
lution” time stamp in the FLASH log file. Hence we ignore 
the initialization which is negligible in production simula¬ 
tions (Cordery et al., 2014). As can be seen in fig. 28 we 


find the best speedup with a one-to-one CPU/GPU ratio 
(24 CPU Cores and 24 GPU devices). Here, we see a total 
speedup of factor 10 compared to the CPU-BH solver and 
a speedup of 3 compared to the gravity Gridsolver (see fig. 


29). 


5. Conclusions 

We described a GPU accelerated BH tree code as an 
additional Poisson solver for the FLASH4 software pack¬ 
age. Our implementation works only for three-dimensional 
problems. Furthermore, the tree walk is limited to a max¬ 
imum number of « 32 x 10® data cells for each process^] 
Three different simulation setups were used to test the ac¬ 
curacy and the performance of our novel CPU-BH code. 
Depending on the setup and GPU/CPU ratio we find a 
speedup for the gravity unit of at least a factor of 3 and 
up to 60 compared to the original CPU-BH implementa¬ 
tion. For GPU/CPU ratios below 1/6 the runtime of our 
simulations is no more dominated by the gravity unit but 


Assuming a GPU with at least 6GB of memory 
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Figure 29: Evolution step speedup factors of our accelerated grav¬ 
ity solver with varying CPU/GPU ratios and a constant number of 
CPU cores compared to the CPU-BH code | | • [ | and the Gridsolver 
j M | |. The X-axis refers to the GPU count for different GPU/CPU 
ratios using a total of 24 CPU cores. All values are evaluated for the 
SP-TS simulation run with 24 CPU cores. 


by the hydro solver. Hence we find lower speedup fac¬ 
tors between 1.6 and 10 for the total application runtime. 
We have shown that even with a small GPU/CPU ratio 
an advantageous performance gain can be achieved. The 
GPU-BH code was written for GPU-devices with a min¬ 
imum compute capability of 2.0 (Fermi architecture) and 
we expect further runtime improvements by porting the 
code onto more modern devices with a higher compute 
capability. Here, we expect improvements not only from 
higher clock rates and higher register counts, but from ad¬ 
ditional features. E.g. with a compute capability > 3.0 
the warp shuffle functions shflO could be used to reduce 
the shared memory usage in the force calculation kernels. 
Since our GPU code is written in CUBA, only NVIDIA 
GPUs are supported with the current version. A portable 
version of the GPU-BH tree code ported to OpenGL will 
be available in a later release. The GPU-BH tree solver 
for FLASH4 is available for free from the Hamburg Obser¬ 
vatory at www.hs.uni-hamburg.de/gpubh. 
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Appendix A. Periodic boundaries 

The GPU-BH tree code supports isolated and periodic 
boundary conditions. In case of periodic boundary condi¬ 
tions, a node’s or data cell’s contribution to the gravita¬ 
tional potential of a data cell dc changes to 

</ = GM/EF(rO. (A.l) 

Here, G is the gravitational constant, M is the mass of the 
node or data cell, r is the position vector between dc and 
data cell or node and /ef is the Ewald field. The original 
Ewald method is a method for computing the gravitational 
field for problems with periodic boundary conditions in 
three directions. Using the original method, the gravita¬ 
tional potential is split into two parts, 

Gmjr = Gm erf (ar)/r -|- Gm erfc(ar)/r (A.2) 

where a is an arbitrary constant. By applying Poisson 
summation formula on the erfc terms, the gravitational 
field at position r can be written in the form 

N 

$(r) = - Cy^maX 

a—1 

As(r,fa,lii,i 2 ,i 3 ) -FAi(r, fa, lii,12,13) 

(A.3) 

The first sum runs over whole computational domain, where 
mass ma is at position fa. The second sum runs over 
all neighboring computational domains, which are at posi¬ 
tions and As(f,fa,?ii,i2.i3) and AL(f, fa, ^11,1243) are 

short, resp. long-range contributions. The Ewald field is 
calculated once on startup and stored in a large array rep¬ 
resenting a hierarchy of nested grids. The evaluation of the 
Ewald Eield at certain points is only needed during the fi¬ 
nal force evaluation and carried out during the respective 
tree walk using quadratic interpolation. 



Appendix B. Usage 


The GPU-BH tree code for PLASH4 implements a set 
of runtime parameters to control the accuracy and run¬ 
time. All the parameters can be set individually using the 
common flash.par file. The opening angle parameter 0 
can be set with grv-bh_gpuLimAngle to a value between 
0.1 and 1.0. The gathered data strategy can be selected 
with setting the runtime parameter grv_bh-gpu-Concat to 1 
where the default value of 0 refers to a serialized GPU de¬ 
vice access pattern. In case periodic boundaries are used, 
the respective Ewald Eield is controlled via the runtime 
parameters gpuEwaldSeriesN which controls the range of 


the indices ii,i 2 ,i 3 in (Eq. A.3) and gpuEwaldSeriesNref 


which refers to the number of nested grids in the field ar¬ 
ray. The number of data points in each grid can be set 
with the parameters gpuEwaldEieldNx, gpuEwaldFieldNy 
and gpuEwaldFieldNz. Note, that the Ewald field resides 
in GPU memory during the hole simulation. 
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